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Abstract 


Given photometric broadband measurements of a galaxy, Gaussian pro- 
cesses may be used with a training set to solve the regression problem of approx- 
imating the redshift of this galaxy. However, in practice solving the traditional 
Gaussian processes equation is too slow and requires too much memory. We em- 
ployed several methods to avoid this difficulty using algebraic manipulation and 
low-rank approximation, and were able to quickly approximate the redshifts 
in our testing data within 17 percent of the known true values using limited 
computational resources. The accuracy of one method, the V Formulation, is 
comparable to the accuracy of the best methods currently used for this problem. 
However, the V Formulation is quicker, and its numerical stability is superior. 

If 10 percent of outliers are removed from the testing data, the error 
is reduced from 17 percent to 12 percent. A necessary condition was found 
for identifying these outliers, that the norm of a collection of redshifts corre- 
sponding to these outliers is comparatively large when compared to the norm 
of a randomly selected group of redshifts. Moreover, viewing characterization 
of outliers as a classification problem, it was possible to reduce the error to 
approximately 15 percent. 
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1 Redshift and photometric observations 

Let y = (yi, ...,y n ) T be a sequence of redshift observations of n galaxies. 
In Way and Srivastava’s [Way and Srivastava, 2006] observation set which is 
based on the SLOAN sky survey, for each galaxy i there are five photometric 
observations denoted as U, G , i?, /, Z. We use X to denote an n x 5 matrix 
such that the ith row x t of X represents the five photometric observations 
corresponding to galaxy i, and we let U, G , R, I, Z denote the columns of X in 
their respective order. The goal is to fit a model from given observations X and 
y such that using this model, we can find a prediction y* of the redshift y* of 
some galaxy after inserting the galaxy’s photometric observations X* into the 
fitted model. 


1.1 What is a Redshift? 

The redshift of a galaxy, z, is the change in the wavelength divided by the 
initial wavelength of the electromagnetic waves that are emitted by the galaxy. 
Hence, the redshift of z is A redshift of a galaxy indicates that it is moving 
away from the earth. By calculating the redshift of a galaxy, scientists can 
determine many characteristics of that galaxy and the universe. For example, 
a redshift can determine the distance between the galaxy and the earth. 

There are two methods that are used to collect the data needed to calcu- 
late a red shift: photometry and spectroscopy. Photometry uses multiple filters, 
each designed to collect particular wavelengths of the electromagnetic spectrum. 
These filters collect data from 5 band passes - U, G , R , /, and Z , which range 
from the ultraviolet to the infared. Broadband photometry is designed to collect 
5 pieces of data (one from each band pass) which comes from many objects, in 
our case galaxies, in a particular region of space. Spectroscopy, on the other 
hand, uses a diffraction grating which will split the light emitted from the galax- 
ies into the different wavelengths to collect the spectral data. Spectroscopy is 
often preferred because there it collect more data from its object and therefore it 
is more accurate. However, since photometry is cheaper and faster, and because 
it collects more data at one time, it was the preferred method used by Dr. Way 
and Dr. Srivastava in their research [Way and Srivastava, 2006], upon which 
the research in this paper is based. 

There are two main approaches used to calculate photometric redshifts. 
The spectral energy distribution fitting (SED fitting), also known as template 
fitting, compares the spectral energy distribution converted from the observed 
data with the spectral energy distribution of a known template. The training- 
set method (TS method) uses the redshifts calculated of other similar galaxies 
and the newly observed data to calculate the redshifts of these new galaxies. 
The SED fitting approach has typically been preferred because the TS method 
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requires samples of galaxies that are similar in magnitude, color, and red shift. 
These samples have not always been available because the telescopes could not 
reach as far into space. However, now that the surveys can go deeper into 
space, the TS method can be used because there are more samples of galaxies 
with the same characteristics. Dr. Way and Dr. Srivastava described various 
TS methods that have been recently used and developed in their report, that 
enabled them to estimate the photometric redshifts of galaxies. Using the TS 
method, the scientists wanted to find the best mathematical model for the red- 
shift data. Some of the techniques they used, both linear and non-linear, were 
polynomial fittings, support vector machines, and artificial neural networks. In 
particular, the scientists compared the neural network ANNz model, the linear 
and quadratic models, the ensemble model, and Gaussian processes to find the 
best model for the red shifts. The Gaussian process regression was the focus of 
our problem. 


2 Gaussian Processes 

The set of 180,045 measurements through the 5 broadband spectrums 
(UGRIZ) are given by a 180,045 x 5 matrix A , and the corresponding red- 
shift measurements are stored in a a 180, 045 x 1 vector y. Together, these form 
the training dataset. A second set of 20, 229 more measurements X* along with 
20,229 corresponding redshifts stored in y* together form the testing dataset. 
This second set was the testing data, and in practice would be a set of broad- 
band spectrum values whose redshifts were unknown. For the purposes of our 
research the measured redshifts y* were known. The goal is to predict the value 
of y* given A, y, and X* using Gaussian processes. 

The prediction of y* involved both covariances between rows from X 
and itself, forming a covariance matrix K(x-i, x 3 ) and between rows from X and 
X*, forming a covariance matrix K*(x*,xj). The prediction y* of y* is given 
by the traditional Gaussian processes equation, shown below: 

y* = K*{\ 2 1 + K)~ l y 


The parameter A in this equation represents the noise in the measurements 
[Rasmussen and Williams, 2006, pp. 16-17] and in practice it is often selected 
to improve the quality of the model [Rasmussen and William, 2006, Chap. 5]. 

It is not completely clear how to choose the kernel function K , and 
there exist many different kernel functions that apply broadly to many cases. 
Last semester, the polynomial kernel was used, a function that involved matrix 
multiplication of X and X* to generate K. Namely, the kernel took the form 
K = (A 2 1 + AA T )72, where the symbol 7 here indicates componentwise square 
(so that if B = (A 2 1 + XX T ) then kij = bij 2 ). Since A is a 180, 045 x 5 matrix, 
K is a 180,045 x 180,045 matrix. This semester, the study went further by 
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experimenting with many other kernels. Last semester four methods that were 
used and perfected were the reduced rank method, the Cholesky update, the 
conjugate gradient, and the classic method, quadratic regression. We refer the 
reader to last semester’s report for detail of these methods [Cayco, So, et. al., 
2006] 


3 Gibbs Sampler 

The Gibbs Sampler is another method of solving a system of linear equa- 
tions. The Gibbs Sampler is based off of the Monte Carlo idea, a process of 
finding the expected value of a population by creating a random sample of the 
population. Given that this random sample is representative of the population, 
the Monte Carlo idea states that the average of this sample is the expected value 
of the population. The best way to ensure that this sample is representative is to 
take a large sample. In the case of redshifts however, this process is translated 
to vectors and matrices. The population in this case is the covariance matrix 
A, and the random sample will be a set of vectors. 

This set of random vectors will form a matrix S whose covariance matrix 
will ideally be A _1 . The Gibbs Sampler works best in situations where large 
tolerance for error is allowed and standard procedures are too slow. Other 
factors can come into play however, one of which is the condition ratio. The 
condition ratio of A is the largest eigenvalue of A divided by the smallest, and the 
smaller the condition ratio, the more accurate the Gibbs Sampler will be. It was 
seen last semester that by generating A by a polynomial kernel, the condition 
ratio was too large for the Gibbs Sampler to work. So instead this semester the 
squared-exponential kernel was used for the majority of experimentation of the 
Gibbs Sampler to better success. 

The Gibbs Sampler code used in this project was a computer algorithm 
put forth by [Geman and Geman, 1984]. Given A, the elements of S must be 
randomly selected such that they accurately represent A. This is done with a 
series of dot products and matrix multiplication. S is denoted by a series of 
vectors s*, the more vectors, the better. These vectors are generated one entry 
at a time and are dependent both on A, whose columns are denoted as Vi, and 
the vectors of S that have already been created. To ensure that the vectors are 
representative of A, the mean and standard deviation of the entry that is being 
created are computed first. The algorithm put forth by Geman and Geman was 
that the entry of S that lies in the kth column and ith row of S has a mean of: 

i -\- 1 Tl 

Tin = y ' VijSjk -\- y ^ VijSjk— i 

3= 1 j=i + 1 
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The mean is found by taking the semi-dot product of the first i-1 entries 
of the jth column of A with the entries in the kth column of S that have already 
been created, as well as the last n-(i+l) entries of the jth column of A with the 
entries in the column of S previous to the kth column. The entire product is 
then multiplied by the negative reciprocal of the ith diagonal entry of A. Thus, 
this entire process doesn’t require that A be completely computed beforehand 
either. However, it should be noted that each new vector of S depends on 
the vector that came before it, thus to avoid confusion on generating the first 
vector of S, a column of zeroes is automatically made the first column of S. The 
standard deviation is simply computed as: 


t = 


1 


After the mean and standard deviation are computed, the entry in the 
ith row and kth column of S is computed to be Sik = tz + m, where z is a ran- 
dom number created by a pseudo-random number generator at standard normal 
distribution, as most computer systems have such a generator. The process is 
repeated until all the vectors (and columns of S) have been generated, and the 
covariance of S is computed to be ( SS"T)/k where k is the number of vectors 
that were generated to form S. For the various different kernels that were sam- 
pled, the one that was used for the majority of the study of the Gibbs Sampler 
was the squared-exponential kernel. However, despite the improvements on the 
Gibbs Sampler from last year’s studies, obstacles still stood in the way. 

The biggest obstacle was that of time. Time was one of the primary 
reasons why alternate methods to quadratic regression were sought after, but the 
Gibbs Sampler in this form was not a time-efficient code. Due to the complicated 
ordering factors involved in the production of the mean values, the MATLAB 
code that was used required looping codes with looping codes (creating double 
loops) . Therefore, these loops caused the running time for the Gibbs Sampler to 
be very large, especially for larger data sets. Alternate methods were eventually 
devised to decrease running times, like Fortran codes with MATLAB, but in 
the end the Gibbs Sampler was still simply not as fast as the other methods 
devised over this semester. Another factor was that of accuracy, even after 
careful reconstruction through the Fortran code, the accuracy of the Gibbs 
Sampler was only about half as good as that of the other methods examined 
both this semester and last semester. Perhaps the biggest reason behind why 
the Gibbs Sampler was never fully realized was that the code was one of several 
different algorithms that have been proposed for the Gibbs Sampler, some of 
which are much more advanced, complex, and less prone to errors than the 
one used in these experiments. The RMS values for the Gibbs Sampler would 
tend to converge to values twice that of the values in other methods used, so 
while the Gibbs Sampler could work, it wasn’t efficient in relation to the other 
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methods. Further study in the Gibbs Sampler would involve finding a quicker 
or more complex algorithm that would allow for more accurate results without 
sacrificing time efficiency. 


4 Covariance Functions / Kernels 

According to [Rasmussen and Williams, 2006] (for much of this section 
we follow the development in this source) a Gaussian process is a collection of 
random variables, any finite number of which have a joint Gaussian distribution. 
To characterize a Gaussian process it is critical to specify its covariance function. 
A covariance function is a function k(x,x') of two variables, sometimes called a 
kernel, with the property 

J k(x,x')f{x)f{x')dLi{x)di±(x') > 0 

for a suitable function f(x) and measure /i . Covariance functions character- 
ize the nearness or similarity between input data points. A covariance function 
k(x, x') gives the covariance of the values of the random held at the two locations 
x and x' . 

Let {xi\i = 1 n} be a set of input points. Then the Gram matrix 
or covariance matrix is defined as K whose entries are Kjj = k(xi,Xj). A 
Gram matrix corresponding to a general kernel function need not be positive 
semi-definite but the Gram matrix corresponding to the covariance function is 
positive semi-definite. 

We now summarize some of the features of commonly used covariance 
functions: 

The Squared Exponential (SE) covariance function is defined by the 
following equation: 


r 2 

kss(r) = exp(- — ) 

where £ is the length scale. This covariance function is infinitely differen- 
tiable and hence is very smooth. With such strong smoothness, it is sometimes 
unrealistic for use in modelling real physical processes. The squared exponential 
covariance function is also called the radial basis function. 


The Matern class covariance function is defined by the following equa- 
tion: 



U, r (~\ — 2 e v 2vr \v Tf / s/2vr \ 

Materny ' ) — T(v) \ £ / I *-v\ £ ) 

where v, l are positive parameters and K v is a modified Bessel function. 
The Matern class covariance function reduces to the Squared Exponential 
covariance function as v — > oo. It becomes simpler when v is a half integer. 
For example, if v = p + \ (where p is a non-negative integer), it reduces to a 
covariance function that is a sum of an exponential and a polynomial function: 


k Matern(v=p+ I)( r ) = eX P (“¥ ) 


r(p+l) (p+p! 

r( 2 p+l) 0 i\(j)—i)\ 



p—l 


The process becomes very rough for v = \ and for values of v > | , the 
function is as rough as noise. The Matern class covariance function is mean 
square differentiable k times if and only if v > k. Hence, the Matern class of 
covariance functions can be used to model real physical processes and is more 
realistic than the Squared Exponential (SE) covariance function. 

The Rational Quadratic (RQ) covariance function is defined by the 
following equation: 


k RQ(r) = ( 1 +^ 2 ) 


The Rational Quadratic (RQ) covariance function reduces to a Squared Ex- 
ponential (SE) covariance function as a — > 00 . This function is mean square 
differentiable for every a as opposed to Matern class covariance function. 

The polynomial covariance function, in its most general form, can be 
defined by the equation: 


k(x, x 1 ) = (ctq + x T Yi p x') p 


where E p is a positive semidefinite matrix and p is a positive integer. If 
Ctq = 0, the kernel is homogeneous and linear, and otherwise it is inhomogeneous. 
In principle it may not be suitable for regression problems as the variance grows 
with | x | for | x |> 1. However there are applications where it has turned out 
to be effective [Rasmussen and Williams, 2006]. 

The Neural Network (NN) covariance function is defined by the 
following equation: 
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This covariance function is named after Neural networks because the function 
can be derived from the limiting case of a model of a neural network. 

Two or more covariance functions can be combined to produce a new covari- 
ance function. For example sums, products, convolutions, tensor products and 
other combinations of covariance functions can be used to form new covariance 
functions. Details are described in [Rasmussen and William, 2006]. 

5 Low Rank Approximation 

Definition: A positive semi-definite matrix K is a low rank approximation 
of covariance matrix K if y* K — y is small, where y* K is the approximation 
to y* calculated with covariance matrix K and yt, is the approximation to y* 
calculated with covariance matrix K. 

It is difficult to calculate a low rank approximation of K using this def- 
inition directly. However, if K is sufficiently numerically stable and K — K is 

small, then K is a low rank approximation of K. Singular value decomposition 
and partial Cholesky decomposition may be used to find such a K. 


5.1 Singular Value Decomposition 

Covariance matrices are n x n, which for large n makes them impossible to 
calculate in a reasonable amount of time or store in memory. A possible solution 
is to instead calculate low rank approximations of these matrices. There is a 
well-known bound on how good these approximations can be. If ay < «2 < 
• • • < a n are the eigenvalues of any matrix K , and K is any rank m matrix, 
then K — K > a m+ 1 - The singular value decomposition of K can be used 

to find a K such that K — K = a m + 1 , proving K to be the best low rank 
approximation of K [Golub and Van Loan, 1996, p. 72]. Unfortunately, singular 
value decomposition is expensive in terms of time and memory, requiring 0(n 3 ) 
operations, and so it is not feasible with large matrices. Therefore, a more 
economic method is required. 
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5.2 Partial Cholesky Decomposition 

The partial Cholesky decomposition is a generalization of the Cholesky 
decomposition which may be calculated for a positive semi-definite nxn matrix 
K . For any m < n, there exists an n x m matrix V such that K = VV T is a 
rank m partial Cholesky decomposition of A', and it follows from this that K 
is a low rank approximation of K. Moreover, this decomposition is not unique 
many different possible choices of V exist. Rather than lower triangular as in 
the Cholesky decomposition, V is “lower trapezoidal” as illustrated below. 


Vll 

0 

0 

V 21 

V 22 

0 

V(jn— 1)1 

v (m- 1)2 

0 

^ml 

Vm2 

vmn 

V n l 

Vn2 

* 
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r vn v 2 i 

0 V22 

0 0 


^(m— 1)1 * * * Vn 

'U(m— 1)2 ^2m * ' * V2n 


0 'Vmn 


Vrin 


5.3 Partial Cholesky Decomposition With Pivoting 

Partial Cholesky decomposition with pivoting is an algorithm which is 
based on an algorithm to calculate the Cholesky decomposition, simply halted 
after m steps. It may be used to calculate a unique rank m partial Cholesky 
decomposition of K in 0(nm 2 ) operations. Pivoting need not be used, but with- 
out pivoting the decomposition is not unique and there is no theoretical bound 
on the inaccuracy of the approximation. If pivoting is used, then 1 1 K — PP T || 

is bounded by Ci = — m) (4 m + l)a m +i. [Higham, 2002, p. 204] In 

practice, this bound is not necessarily a good indication of the accuracy of the 
approximation, and usually ||Af — PP T || is well under 10a m +i [Higham, 2002, 
p. 207]. The algorithm for calculating the partial Cholesky decomposition with 
pivoting is given below. 


11 



5.4 Algorithm: Partial Cholesky Decomposition with Piv- 
oting 

Given a positive semi-definite K, this algorithm [Higham, 2002, p. 202] 
computes the partial Cholesky decomposition with pivoting algorithm. At each 
step let piv be a vector such that K (piv, piv) is a matrix in which its first to — 1 
diagonal entries equal 0, and its mth diagonal entry is max(diag(K)). Note the 
the output of this algorithm is not K , but V = i'Uij)- 


K = K 
for j = 1 : to 


K = K (piv, piv) 
for i = 1 : j — 1 



K = I< - VmVfn 


T. VkiVkj 
k = 1 


end 


The line K = K (piv, piv) is the pivoting step, and without this step the 
above algorithm would be the Partial Cholesky Decomposition without Pivoting. 


5.5 Pivoting 

Pivoting is useful in forming a numerically stable low rank approximation of 
a positive semi-definite matrix, and to do so it identifies the rows of the training 
data which limit the growth of computer arithmetic errors. A pivot of the 
matrix K, which is simply a permutation of K of the form PKP T corresponds 
to the permutation PX of X . The above partial Cholesky algorithm with 
pivoting will move columns and rows of K so that the to by to leading principal 
submatrix of PKP T has the condition number that is with a constant C 2 of 
ai/ctm- The proof follows from arguments similar to those in [Higham, 2002, 
pp. 195-208]. The constant C 2 is related to the constant C\ above. 

Thus pivoting will tend to construct a low rank approximation whose 
condition number is related to the condition number (ai/a m ) of the low rank ap- 
proximation produced by the singular value decomposition. However the growth 
of computer arithmetic errors in the algorithm depends on the condition num- 
ber of the low rank approximation. Since pivoting limits the condition number 
and the growth of computer arithmetic errors depends on the condition num- 
ber, pivoting will tend to improve the numerical stability of the algorithm. This 
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can, in principle, reduce the effect of computer arithmetic errors. If computer 
arithmetic errors are larger than the other errors (such as measurement errors 
and modelling errors) in the prediction of the redshift, then an algorithm incor- 
porating pivoting may potentially be more accurate than an algorithm without 
pivoting. 

6 V Formulation 

6.1 V Formulation without QR factorization 

Previously, to solve the original equation to predict the redshift, we used 
the Gaussian approach. This required forming an n x n covariance matrix K 
by using some pre-defined formula Kij = k(xi,Xj). 

If Xi* is taken to be the ith row of X* , then a p x n matrix K* can then 
be formed by the following formula: 


K *i = K x * » x i) 


Substituting these K and K* in the original linear equation: 


y* =K*(\ 2 I+K)~ 1 y 


But the size of (A 2 1 + A") -1 is n x n, and for large n (in our case n = 
180, 045), it is not practical to calculate (A 2 I+K)~ 1 directly. Inverting a matrix 
takes 0(n 3 ) floating point operations. When n is large, an 0(n 3 ) operation 
quickly becomes intractable. This problem may be by passed using simple linear 
algebra and the low rank approximations detailed in the previous section. 

To do this we will approximate K with VV T where V is produced by the 
partial Cholesky factorization. If pivoting was used in constructing V we will 
assume, for simplicity of presentation, that K is the covariance matrix after its 
rows and columns have been moved. Now K* is a p by n matrix. Let K J be 
the first m columns of K* and let V\\ be the m x m matrix consisting of the 
first to rows of V, where to < n. Also let V* = K*V-^ r . Now in addition 
to approximating K with VV T we also approximate K* with V*V T . Using 
these approximations we see that K*(\ 2 I + K)~ l y can be approximated by 
V*V T (X 2 1 + VV T )~ 1 y. We now have the following key Lemma: 
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Lemma: For A ^ 0, V T (X 2 1 + VV T )- 1 = (A 2 1 + V T V)~ 1 V T 

Proof: 

Note that 

X 2 V T + V T VV T = 

(A 2 I + V T V)V T = 

V T (X 2 I + VV T ). 

Since V T V and VV T are symmetric semidefinite (A 2 1 + V T V) and (A 2 1 + VV T ) 
are invertible. Multiplying on the left by (A 2 / + V T V)~ 1 and on the right by 
(A 2 1 + VV T )~ 1 , the lemma follows. ■ 

From this lemma, it follows that by substitution that: 


Corollary: y* = V*(X 2 I + V T V)~ 1 V T y 


This corollary is the basis of the V Formulation method. The matrix 
(A 2 1 + V T V)~ 1 is to x m rather than n x n, and so for small enough to, the 
equation is tractable, and in fact can be solved quite quickly. The flop count 
for solving this new equation is 0(nm 2 ). 

We will refer to the above method as the V method (or as the VF method 
in the labels of a few graphs). 

6.2 QR Factorization in V Formulation 

The condition number of V T V is the square of the condition number of 
V, and so if V has a condition number that is sufficient large, then the V 
formulation can potentially be numerically unstable. Therefore, it may be useful 

to make the equation more stable. To do this let A = ^ ^ ^ and let 

, where / is an to x to identity matrix and 0 is an to x 1 zero vector. 

Here A is an (n + in) x to matrix and b is a ( n + to) x 1 vector. Consider the 
least square problem: 

min || Ax — 6|| 

where the norm (|| ||) indicates the usual Euclidean norm. The normal equa- 
tions solution to this least squares problem is [Golub and Van Loan, 1996, p. 
237] x = (A T A)~ 1 A T b = (A 2 1 + V T V)~ 1 V T y. Therefore 

y* = V*(X 2 I + V T V)~ 1 V T y = V*x = K*V^ T x. 

However, it can be more accurate to solve a least squares problem using the 
QR factorization [Golub and Van Loan, 1996, p. 244]. In this approach [Golub 
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and Van Loan, 1996, p. 239] one first factors A = QR where Q is an ( n+m ) x m 
matrix with orthonormal columns and R is an to x in right triangular matrix. 

Then x = R~ 1 Q T b = R~ 1 Q T ^ ^ so that 

y* - - K\V-A T x = K*V 1 ~ 1 T R~ 1 Q T b = R~ 1 Q T ^ . 

With the above algorithm y* can still be solved quickly. The flop count for 
solving this new equation is 0(nm 2 ). This algorithm will have better numerical 
stability properties than the algorithm discussed in Section 7.1. 

We will refer to the V method using a QR factorization as the VQ method. 


7 Subset of Regressors 

7.1 Subset of Regressor without QR Factorization 

The subset of regressors method is similar to V Formulation in that an ap- 
proximation of the traditional Gaussian processes equation is formed. However, 
low rank approximation is not used directly, rather the matrices K, K* are 
partitioned into submatrices, only some of which are used in the calculation. 
Linear algebra can then be used to manipulate the original equation into an 
approximately equal equation that may be solved. 

Suppose there exists a n x to matrix V such that K = VV T (This is of 
course not possible if K is of rank greater than to, and that is why the solution 
given by this method is only approximate). Then: 

{,* = V*(X 2 I + VV T )- 1 V T y 


We partition these matrices as follows: 



K* = (K* K*) and V = 


R'i i is m x m. 
K' 2 i is (n — m) x m. 
K\ is n x m. 

K{ is p x m. 

Vn is to x TO. 
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It follows that: 


(£; £)-<* ^) = © w ' W-vm 

Therefore Ad = and Adi = VuVyl, and similarly K{ = I 7 * 

Assuming Vn £ j^nxm j g i nver tibl e we have KiV{[ t = V and klV{[ T = V* 
Substituting into y* = V*(X 2 I + VV T )~ 1 V T y yields: 


r = ^r^n T (A 2 /+ V^KlK^y'V^Klv 

V* = K{ [Fu(A 2 1 + V^KjK.V^V^Kly 
y* = ^[A a Vu^ + KlK^Kly 
y* = Kf[X 2 K n + Kf Ki]~ 1 K'[y 


Note that A 2 Adi + AT^A'i is an to x to matrix, so that this method is 
also 0{nm 2 ). Also note that V is not present in the final equation, so that it 
does not need to be calculated. For that reason, the subset of regressors method 
is the fastest of our methods. However, it is also the least stable. 

Above we developed the subset of regressor method by first presenting the 
V method. We should add that the subset of regressors method can be derived 
directly from y* = K*{X 2 I + K)~ l y. To do this approximate I\ and K* using 

K = K = Ad A'd 1 K? 

and 

K*^K* = A , A'd 1 !<J ■ 

Next in y* = K*(X 2 I + K)~ l y we replace K with K and K* with K* so that 

y* = K*(X 2 I + K) _1 y = 

AdA-d'Af (A 2 1 + K-J^Klr^y = 

K* K- 1 (A 2 / + Kj Ad A-d 1 ) - 1 Af y = 

K*q A 2 Adi + Ad T Ad)- 1 Ad T y. 

Also in this section note that if, in the subset of regressors formula, we 
substitute Ad = VV^ and Adi = Id i V'^ then the V formulation equations 
follow from the subset of regressor formula. 

Finally, we note that the above derivations show that the V formulation 
and the subset of regressors formulas are mathematically equivalent, assuming 
that the order of the columns and rows in K are the same when applying the 
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formulas. This is the case, for example, if the V formulation uses Cholesky 
factorization without pivoting to form V or if, assuming that pivoting is used 
in the Cholesky factorization, then the columns and rows of K are pivoted in 
the same manner prior to applying the subset of regressors formulas. Therefore 
the primary difference is in the numerical properties of the formulas, which we 
address shortly. 

We will refer to the subset of regressors method, as outlined above, as the 
SR method. 

7.2 QR Factorization in Subset of Regressors 

The subset of Regressors method using QR factorization requires only a 
slight modification to the Subset of Regressors method. As in the previous 
section, an approximation of the traditional equation is given by: 


y * = Kt[\ 2 K n + KfK^Kfy 

Now we factor the m x to matrix A '11 = tiiFn- The m x to matrix 
Vu can be calculated by Cholesky factorization of Ku. Now let A = ^ ^ 

and let b = ^ ^ ^ , where 0 is an to x 1 zero vector. Here A is an ( n + m) x to 
matrix and b is a (n + m) x 1 vector. Consider the least square problem: 

min 1 1 Ax — 6||. 


The normal equations solution to this least squares problem is [Golub and Van 
Loan, 1996, p. 237] x = (A T A)~ 1 A T b = (A 2 Vh V^ + Kf K^Kfy = (A 2 iQi + 
K-[ Ki)- 1 Kj y . Therefore 

r = Kl[X 2 K n + KjK-^Kjy = K*x 


However, it can be more accurate to solve a least squares problem using the 
QR factorization [Golub and Van Loan, 1996, p. 244]. In this approach [Golub 
and Van Loan, 1996, p. 239] one first factors A = QR where Q is an (n+m) x to 
matrix with orthonormal columns and R is an to x in right triangular matrix. 

Then x = R~ 1 Q T b = R _1 Q T ^ ^ ^ so that 


y* = K*x = K* 1 R~ 1 Q T b = iqRr 1 Q T 

With the above algorithm y* can still be solved quickly. The flop count for 
solving this new equation is 0(nm 2 ). 

We will refer to the SR method using the QR factorization as the SRQR 
method. 
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8 Numerical Stability 

8.1 Squaring the condition number 

Consider the subset of regressor method without QR (SR), the subset of regres- 
sor method with QR (SRQR) and the V formulation (V). They all provide a 
solution to 

min 1 1 Ax — 6|| 

where A = ^ yyT ^ and let 6 = ^ ^ ^ , where 0 is an mxl zero vector. Here 
A is an (n + in ) x m matrix and b is a (n + m) x 1 vector. 

SR: x = [A 2 A'n + K 1 \~ 1 K^y 

SRQR: x = R~ 1 Q t ^ ^ 

V: x = V 1 - 1 T (X 2 I+V T V)- 1 V T y. 

Here A = QR where Q is a ( in + n) x m matrix with orthonormal columns and 
R is a mx to right triangular matrix. Also I\\ = VVj 7 ) and Adi = Id \ Vy Y where 
V is an n x to matrix and the to x to matrix V\\ is the hrst to rows of V. 

The SR formula is just the normal equation solution to the above least 
squares problem and the SRQR formula is the QR solution. As indicated in 
[Golub and Van Loan, 1996, p. 244-245] there is potential for numerical insta- 
bility in using normal equations (since the error growth is always governed by 
the square of the condition number of A) whereas as use of the QR factorization 
is numerically stable. Here the phrase “numerical stability” is used to indicate 
that the algorithm produces a calculated answer that is close to the accuracy 
of the answer permitted by the condition number of the problem (see [Golub 
and Van Loan]). Therefore we should expect that the SRQR formulation will 
produce more accurate results than the SR formulation. 

To consider the V formulation we will begin by considering the special case 
where A = 0 and later consider the more general case. In the case that A = 0 
we have: 

V: x = V{[ T (V T V)~ 1 V T y. 

where K\ = V V j 7 ] . There is a potential concern in using this equation since to 
construct x the linear system of equations 

(V T V)z = V T y 

must be solved. Forming V T V squares the condition number of V which, poten- 
tially could lead to the introduction of undesirable computer arithmetic errors. 
However we will argue that the matrix B = V T V is diagonally equivalent to 
matrix that, in practice, is well conditioned and that this will limit the growth 
of computer arithmetic errors. 
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Now V is formed by a partial Cholesky factorization of the symmetric semi 
definite (SSD) matrix K. If pivoting is included in the partial Cholesky factor- 
ization of the SSD matrix it follows, for each i = 1 , ,m, that the i th diagonal 
entry is at least as large in magnitude as any off diagonal entry in row i or col- 
umn i [Trefethen, 1996, p. 176]. From properties of the Cholesky factorization it 
then follows that the lower trapezoidal matrix V has the property that, for each 

1 = 1 , ,m, the i th diagonal entry in V is at least as large in magnitude as any 
entry in column i. Therefore we can write V as V = WD where D is an mxm 
diagonal matrix and W is an nxm lower trapezoidal matrix with all entries one 
or less in magnitude and with ones on the diagonal. Indeed this matrix W is 
identical to the lower trapezoidal matrix produced if Gaussian elimination with 
partial pivoting is applied to K\ (and indeed, in this case, Gaussian elimination 
with partial pivoting will not pivot any entries). However it is known [Tre- 
fethen, 1997, p. 169] that, in practice, the lower trapezoidal matrices produced 
by Gaussian elimination with partial pivoting are well conditioned. 

Thus V is a diagonal rescaling of a well conditioned matrix W. Now it follows 
that we can write V = WD 1 D 2 where the entries of the diagonal matrix D\ are 
between 1 and 2 and where entries in D 2 are exact powers of 2. Since W is well 
conditioned then so is U = WD\ (since cond{WD\) < cond(W)cond(Di) < 

2 cond{W)). Now note, since V T V is symmetric positive definite, that algo- 
rithms for solving V T Vz = V T y do not require pivoting. However, in this case, 
scaling V T V by exact powers of two does not effect the numerical stability of the 
solution z to V T V z = V T y [Forsythe and Moler, 1967, pp. 37-46]. Therefore 
the growth of computer arithmetic errors in solving V T Vz = V T y is indicated 
by the condition number of U T U (the rescaling of V T V by exact powers of two). 
However since U has a modest condition number (since it is well conditioned in 
practice) so does U T U (since its condition number is the square of the condition 
number of U). 

The conclusion from the above argument is that we do not square the condi- 
tion number of a ill-conditioned matrix in the V formulation and that we expect 
that this will limit the error growth in using the V formulation. 

To consider the case that A / 0 we note that in this case the condition 
number of B — (A 2 1 + V T V) will be important in solving 

(A 2 I + V T V)z=V T y. 


However we have 

Theorem 1 For any A > 0, cond(X 2 I + V T V) < cond(V T V). 

Proof. If V T V has eigenvalues oq > «2 > • • • > OL m > 0 then the eigenvalues 
of (A 2 1 + V T V) are (A 2 + cq), i = 1, . . . , m. Therefore cond(V T V ) = a\/a m 
and cond(\ 2 I + V T V) = (oq + \ 2 )/{a m + A 2 ). However it follows easily that 
Oii/otm > (aq + A 2 )/(a m + A"). ■ 

Since cond(X 2 1+V T V) < cond(V T V ) we expect that solving (A 2 1+V T V)z = 
V T y with A ^ 0 will be more accurate than solving this equation with A = 0. 
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Since we have argued that the error growth in solving this equation for A = 0 
should be limited we expect that this should also be true when A yf 0. 

In summary, the above argument indicate that the SRQR and V formulations 
should have better numerical stability properties than the SR formulation. This 
will be supported by the numerical experiments that we present later. 

We should add that the SR formulation has the advantage that it can be 
more efficient and require less memory than the other approaches. We will see 
this in our later experiments and also see that the V formulation can also be 
quicker and require less memory than the SRQR method. 
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Therefore in this case Vu, Ki and A are all exactly singular and therefore 
have condition numbers = oo. Therefore an algorithm which used the first two 
columns of K, ignoring this singularity, would fail. 

Now consider a reordering of K , switching the second and third rows and 
columns. For K , the reordered K we have 

/ 1 0 0 
k= oio 
\ o o o 

If we consider a rank two approximation to K then it follows easily that 



and 


Vu = 


1 0 
0 1 


K\ = 


1 0 
0 1 
0 0 


A = 


Ki 

AVn 


After reordering the new matrices V\\, Ki and A are well conditioned. For 
example cond(Vii) = cond(K i) = 1. 

As mentioned in Section 5.5 the partial Cholesky factorization with pivoting 
can be used to reorder the columns of K in a suitable manner. 


8.3 Related methods 

The V formulation is related to a general least squares algorithm due to Peters 
and Wilkinson [Bjork, 1996, pp. 73-76]. Also the formulas in [Wahba, 1990, 
p. 136] are related to to our V formulation formulas. The formulas in [Wahba, 
1990, p. 136] were used to develop practical techniques for the computations 
related to generalized cross validation. 


9 Choice of Rank 

In using low rank approximation the choice of rank will affect the accuracy of the 
approximation. It may be impractical to repeat the computations for a variety 
of different ranks and is useful to have techniques to facilitate determination of 
the accuracy of a variety of low rank approximations. 

It is well known that there are efficient techniques to remove columns (down- 
date) and add columns (update) in solving general least squares problems [Bjork, 
1996, pp. 127-152]. These techniques can be adapted to calculations for all of 
our low rank approximation techniques (V, VQ, SRQR). For example one can 
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calculate the solution for a rank m low rank approximation and use these tech- 
niques to calculate the rmse (root mean square errors) of the approximations for 
low rank approximation for all ranks k, k = 1, . . . , to. This can be implemented 
efficiently and require only a small additional amount of computing time. We 
have included this option in our code. 


10 Optimization of Hyperparameters 

The hyperparameters for any particular covariance function K may be op- 
timized by minimizing the output of a function (f) which takes as input a vector 
of hyperparameters 9q and outputs y* Kl the root mean square error of the ap- 
proximation with respect to K . We used two available MATLAB functions, 
fminsearch.m and minimize. m, to accomplish this and find a good value of 9 to 
use. 


10.1 Nelder-Mead Simplex Minimization 

The standard MATLAB function fminsearch uses an initial estimate to min- 
imize a multi- variable function, which is known as unconstrained nonlinear opti- 
mization. The algorithm used by fminsearch is the Nelder-Mead simplex [Nelder 
and Mead, 1965] search method, which does not make use of gradients. This 
method becomes extremely slow as the number of hyperparameters increases, 
but it can output the true optimum values but it is possible for fminsearch.m 
to output a local minimum. 


10.2 Marginal Likelihood and Minimize 

Marginal likelihood p(y\X, 9) is the probability of observing a set of red- 
shift measurements in y given photometric filter observations X and a series 
of hyperparameters, stored as a vector 9. To better predict the values of red- 
shifts y, one can set X and y as fixed constants whose values are determined 
by previously-observed training data sets and maximize the marginal likelihood 
function, setting 9 as a variable. An “optimal” set of hyperparameters can 
then be found. One can alternatively determine optimal hyperparameters by 
maximizing the natural logarithm of the likelihood function, which according to 
[Rasmussen and Williams, 2006, p. 113] is: 


logp(y\X,6) = -\y T A x y - |logdet(A) - f log27r 
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We utilize minimize. m, a Matlab program that minimizes a multivariate 
differentiable function provided by Rasmussen [2006], to minimize the negative 
logarithm of the marginal likelihood function-an equivalent procedure to the 
maximization of the original marginal likelihood function-and thus determine 
optimal hyperparameters given each of the covariance functions under consider- 
ation. Computer memory issues have limited the minimization process to 2000 
data training points in X and y. 


10.3 Optimization Method 

The speed of minimize. m allowed us to use it to calculate good values 
for the vector of hyperparameters 6 for each method and covariance matrix 
at our disposal. The vector 6 was used to compare the ability of different 
method/covariance matrix combinations to predict y* . The function fmin- 
search.m can then be used with the combinations which seem to best predict y* 
in order to attempt to improve the accuracy. However, the new results found 
with fminsearch did not constitute a significant improvement. 

We should note that minimize is not tailored to any of the low rank approx- 
imation algorithms. To choose the hyperparameters using minimize one must 
selected a subset of the data (for example 500 galaxies). Minimize then uses 
the traditional Gaussian process formula in it optimization calculations. This 
procedure was followed in the examples presented in [Rasmussen and Williams, 
2006, p. 182-184], We also followed this procedure. Although it might be of 
interest to develop a method for selecting hyperparameters that is tailored to 
low rank approximations we did not do so in this project. 

With fminsearch there is the danger of overfitting since fminsearch, as we 
used it, makes use of y*. Our hypothesis was that the dataset of 180,045 inputs 
was sufficiently large to ensure that overfitting would not result in significantly 
different results for new datasets. This hypothesis was confirmed by splitting y* 
into two smaller datasets indexed by I\ and I 2 , where I\ and I 2 are distinct in- 
dexes containing 10, 114 numbers and 10, 115 numbers respectively (recall that 
y* contains 20,229 elements total). We used fminsearch to find optimized hy- 
perparameters, 8j 1 and dj 2 , corresponding to y*(Ii) and y*(l 2 ), respectively. 
The number \6i 1 — 8 j 2 |, if large enough, is clear evidence of overfitting in our 
optimization. However, in our case 6b, — 0/ 2 1 was not large enough to be sig- 
nificant in terms of our computations (on the order of 10 -3 or ICO 4 in each 
case), especially with regard to the V formulation and V formulation with QR 
factorization methods, which are relatively independent of the hyperparameters 
chosen. 

Our calculations with covariance functions made use of the covariance func- 
tions that are included with the software associated with [Rasmussen and Williams, 
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2006]. The hyperparameters that we calculated include A and other parame- 
ters that are detailed in Rasmussen and Williams software. We refer the reader 
to http://www.gaussianprocess.org/gpml/code/matlab/doc/ for a detailed def- 
inition of the hyperparameters. Some of Rasmussen and Williams’ covariance 
functions came with two versions. The ARD (for automatic relevance deter- 
mination) version had more hyperparameters the the simpler versions. Also 
Rasmussen and Williams find it convenient to use the logarithm of hyperpa- 
rameters as input to there routines. Below is a list of the logarithms (base e) 
of the hyperparameters used in our calculations. The are three sets: the first 
set determined by using minimize with 500 galaxies, the second determined by 
using minimize with 1000 galaxies and the third determined by using minimize 
with 2000 galaxies. 

Most of the runs reported later were done using the first set of hyperpa- 
rameters below. For the algorithms that are numerically stable (gprSRPPl, 
gprSRPPlb, gprSRPP2, gprSRPP3, and gprSRPP4 described in the Appendix) 
we did not find a significant change in the accuracy of our results if we we used 
the other sets of hyperparameters. However, we should note that for gprSRPP 
the choice of hyperparameters can be more important. For example gprSRPP 
often exhibits numerical instability when using the hyperparameters in the first 
set below (selected by using the minimize function with 500 galaxies). How- 
ever, the numerical stability of gprSRPP appears to be improved when using 
the other sets of hyperparameters. 

Hyperparameters: calculated using 500 galaxies 


Covariance Matrix 

log#! 

log # 2 

log # 3 

log # 4 

log # 5 

Rational Quadratic ARD 

1.2622 

0.7822 

1.3818 

1.6949 

2.0585 

Rational Quadratic 

0.9275 

-1.0322 

3.1650 

-3.8940 


Squared Exponential ARD 

0.9828 

0.4503 

1.0168 

1.2166 

1.4851 

Squared Exponential 

0.9144 

-1.0045 

-3.8963 



Matern 3 

2.9316 

0.3722 

-3.9180 



Matern 5 

1.7937 

-0.4499 

-3.8965 



Neural Net 

-1.1236 

0.0033 

-3.9939 



Polynomial, r = 2 

-5.2931 

-11.4373 

-3.9051 



Polynomial, r = 3 

0.0890 

-5.0957 

-3.8984 




Covariance Matrix 

log # 6 

log # 7 

log # 8 


Rational Quadratic ARD 

-0.3107 

1.7217 

-3.8937 

Rational Quadratic 




Squared Exponential ARD 

-0.8526 

-3.9216 


Squared Exponential 




Matern 3 




Matern 5 




Neural Net 




Polynomial, r = 2 




Polynomial, r = 3 
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Hyperparameters: calculated using 1000 galaxies 


Covariance Matrix 

log 9 l 

log0 2 

log<? 3 

log 0 4 

log 0.5 

Rational Quadratic ARD 

0.8850 

0.4836 

1.0141 

1.4115 

1.8591 

Rational Quadratic 

1.0946 

-0.9851 

0.2097 

-3.9430 


Squared Exponential ARD 

0.5588 

0.1978 

0.8059 

1.1540 

1.4546 

Squared Exponential 

0.6896 

-1.3313 

-3.9473 



Matern 3 

2.6163 

-0.0748 

-3.9601 



Matern 5 

1.5403 

-0.8202 

-3.9476 



Neural Net 

-2.9613 

-0.5618 

-4.1775 



Polynomial, r = 2 

-5.5591 

-11.4599 

-3.9182 



Polynomial, r = 3 

-1.3607 

-7.7182 

-3.9218 




Covariance Matrix 

log 0 6 

log0 7 

log 0 8 

Rational Quadratic ARD 1.8591 

-0.9770 

1.2229 

-3.9560 

Rational Quadratic 




Squared Exponential ARD 

-1.3919 

-3.9496 


Squared Exponential 




Matern 3 




Matern 5 




Neural Net 




Polynomial, r = 2 




Polynomial, r = 3 





Hyperparameters: calculated using 2000 galaxies 


Covariance Matrix 

log 0i 

log 0 2 

log 0 3 

log 0 4 

log0 5 

Rational Quadratic ARD 

0.3942 

0.4083 

0.6969 

1.0122 

1.3051 

Rational Quadratic 

0.7061 

-0.9527 

0.1628 

-3.9692 


Squared Exponential ARD 

0.1247 

0.2648 

0.6504 

0.9302 

0.9864 

Squared Exponential 

0.2725 

-1.3061 

-3.9765 



Matern 3 

1.8813 

-0.6291 

-3.9842 



Matern 5 

1.1055 

-0.9277 

-3.9724 



Neural Net 

-2.8870 

-0.5853 

-4.1261 



Polynomial, r = 2 

-3.7483 

-8.3263 

-3.8932 



Polynomial, r = 3 

0.6935 

-0.4289 

-3.9343 




Covariance Matrix 

log0 6 

log0 7 

log0 8 



Rational Quadratic ARD 

-1.0845 

0.8351 

-3.9720 



Rational Quadratic 






Squared Exponential ARD 

-1.2208 

-3.9639 




Squared Exponential 






Matern 3 






Matern 5 






Neural Net 






Polynomial, r = 2 






Polynomial, r = 3 






25 





11 Timings 


The tic toe commands in MATLAB were used to time different methods under 
the same conditions. The times shown in the table below represent the amount 
of time it took to calculate y*, the predicted redshift. (Note that the routines 
described in the appendix can also be used to calculate estimated variances. 
The timings below are only the times for calculating the predicted redshifts and 
do not include times to calculate the variances.) In terms of speed, the methods 
used this semester were at least as good as those from last semester. The 
following table lists the computing times when using the quadratic covariance 
function with rank = 21 and three different subsets of the entire dataset of 
180045 galaxies : 


Computing times for the quadratic covariance function 

with rank = 21 


Method \ Size of subset of the data set 

1000 

10,000 

180,045 

V Formulation (with pivoting) 

0.1410 

0.2340 

4.3280 

V Formulation (no pivoting) 

0.2340 

0.1560 

1.5000 

V Formulation (with pivoting) using QR 

0.5930 

0.1560 

2.0470 

Reduced Rank with Cholesky 

0.5620 

0.2650 

4.5940 

Reduced Rank 

0.0940 

0.1570 

1.5470 

Cholesky Update with Partial Cholesky 

0.3120 

0.4690 

10.2810 

Cholesky Update with QR Factorization 

0.1250 

0.3440 

6.5460 

Subset of Regressors (no pivoting) 

0.0150 

0.0460 

0.2660 

Subset of Reg. (no pivoting) using QR 

0.0470 

0.2500 

5.1250 

Subset of Regressors (with pivoting) 

0.0780 

0.2500 

4.3750 

Conjugate Gradient with Partial Cholesky 

0.7030 

0.8440 

21.3600 

Conjugate Gradient 

0.1410 

0.6720 

14.1090 

Gibb’s Sampler 

241.5460 

232.1250 

232.5150 


Using results in the above table we can picture several of the methods (SR 
and V) introduced in this semester’s project with several of the methods used 
in the first semester’s portion of the project (RR.- reduced rank, CU - Cholesky 
update, CG - conjugate gradient). This is shown in Figure 1. 

Note that the SR and V method are faster than the methods (RR, CU and 
CG) used in the first part of this project. The times of the V method and the 
RR method are close but it should be added that the RR method can be less 
accurate than the V method. 

We have done additional timings that compare the SR, V method without 
pivoting, SRQR, and the V method with pivoting (using the column and row 
order determined by the partial Cholesky factorization with pivoting). These 
results are shown in Figures 2 and 3. In the results reported in these figures and 
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time vs. method, quadratic kernel, rank = 21 



Figure 1: Times of two methods used in the current semester and three methods 
used in the previous semester. 


in later sections for the SR, V method without pivoting, and SRQR method we 
left the column and row ordering of the covariance matrix unchanged (no piv- 
oting). Also in all our results reported below and in later sections “V method” 
will refer to the V method with pivoting, unless we explicitly indicate no pivot- 
ing. Also the “VQ method” will refer to the using the QR factorization in the 
V method (see Section 6.2) with pivoting. In some cases we also did timings of 
the result of calculating the RMSE errors for all low rank approximations less 
than a specified rank. In the graphs we have used “history” to indicate these 
calculations. These calculations were done using the neural network covariance 
function and with a rank of 100 and also with a rank of 1000. 

We see from these plots that the extra computations involved in calculating 
the RMSE errors for all lower order approximations is small compared to the 
time for the initial calculations in computing the low order approximation. We 
also see that the SR method is the fastest, the V method without pivoting is 
next fastest and this is followed by the SRQR method. The V method with 
pivoting is slowest. 

We should add that part of the reason that the V method with pivoting is 
slower is that Matlab does not have a built in routine for the Cholesky factor- 
ization with pivoting. We therefore wrote the code to do this using Matlab’s 
interpreted code which is slower than compiled code. Although we did not do 
so, it is possible that one could write this code in a compilable language like For- 
tran or C (and then use Matlab’s MEX utility). Such code would significantly 
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Figure 2: Comparision of the times of four methods for the neural network 
covariance function with rank = 100 


reduce the run times for the V method with pivoting. 


12 Accuracy and Stability 

12.1 Bootstrap resampling results 

One possible way to minimize the error in our prediction is to simply use the 
entire training dataset of 180,045 broadband measurements to train each of our 
methods to predict future redshifts given a separate set of broadband “testing” 
points. A limitation of this approach is that essentially only one sample is 
available from which to gauge the accuracy of our methods in the general case. 
It may be the case that if different training data were introduced, the result 
would be quite different. Although we have a limited set of data, somehow we 
want to determine how the method will respond to new sets of data. 

Given the size of our training dataset, it is reasonable to assume that 
it is a representative sample of the galaxies in the known universe, so that the 
galaxies with the various properties contained in the dataset occur in the uni- 
verse at approximately the same frequency. If this hypothesis is correct, then 
a statistical inference technique called bootstrap resampling may be employed. 
Bootstrap resampling takes the data that we have as the total known “popula- 
tion,” from which we can randomly sample to generate new training datasets. 
The standard approach to bootstrap resampling is to sample with replacement. 
If the size of the sample is less than the size of the training set one could also 
use resampling without replacement. Using each of these bootstrap samples, 
all of size 180,045, we can test our methods by fitting a prediction model, and 
compare the errors between the predictions and the actual redshifts. In this way 
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Figure 3: Comparision of the times of four methods for the neural network 
covariance function with rank = 1000 


we no longer rest all of our inferences on a single result for each method; instead 
we have a range of values, similar to the situation of having multiple datasets 
from which to choose. The bootstrap method is implemented as follows. 

With each method, we generate a random sample of galaxy broadband 
measurements and corresponding redshifts from the entire 180,045 dataset, X; 
each sample is of size 180,045 and we adhere to the property of traditional 
bootstrapping of replacement, in which a galaxy can be selected more than 
once for a sample. We then run the model on a separate testing dataset, which 
we call X*, to predict the corresponding redshifts of each galaxy. These redshift 
predictions generated by the model are then compared to the actual observed 
redshifts y* by calculating the root mean square error (RMSE) \\y* — il*\\/y/p 
where the norm is the usual Euclidean norm and p is the number of entries in 
y* and y* . For each selected algorithm and covariance function we repeat the 
above process 100 times, choosing a different random sample from X for each 
of the 100 repetitions. 

Figures 4-7 show the results of the 100 bootstrap runs of both the V 
Formulation and Subset of Regressors method using several different covariance 
functions. The covariance function which performed best in these trials was the 
neural network covariance function. The 100 model runs are ordered from low 
to high according to RMSE. While the SR method matches the V method with 
QR factorization (the VQ method) in terms of accuracy for the best 40 runs, 
the second half of the runs produced steadily worse errors. We attribute this 
lack of stability to computer arithmetic issues inherent in the SR method. For 
the neural network covariance function, due to the instability of the SR method, 
its RMSE error is driven up to .0267. The V Formulation method is shown to 
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be both stable and accurate. For the neural network covariance function the 
RMSE error equal to .0215 with the V method and the variation over individual 
samples is less than .001. The covariance function which performed best with 
all methods was the Neural Net covariance function, as shown in figure 7. 

Comparing the bootstrap accuracy results of the SR and V methods 
with methods explored in the first semester of the project [Cayco, So, et. al., 
2007] such as Cholesky Update and Conjugate Gradient (figure 8), we find that 
the V Formulation provides an absolute improvement in both accuracy and sta- 
bility. The subset of regressors method is lacking in both accuracy and stability, 
especially when the covariance matrix used is of low rank. The quadratic kernel, 
for example, is of rank 21, and the SR method performs badly with the quadratic 
kernel as is clearly indicated in figure 8. However, when a larger rank is used 
such as a low rank approximation of the Neural Network covariance function 
with in = 100, the accuracy and numerical stability is greatly improved as in 
figures 4-7. 
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Figure 4: Bootstrap resampling with the Rational Quadratic covariance func- 
tions, rank = 100. 
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Figure 5: Bootstrap resampling with Squared Exponential covariance functions, 
rank = 100. 


31 




Figure 6: Bootstrap resampling with Matern class covariance functions, rank = 

100 . 

Neural Network Quadratic 




Figure 7: Bootstrap resampling with the Neural Net and Quadratic covariance 
functions, rank = 100. 
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Quadratic Kernel with Various Methods 



Figure 8: Bootstrap resampling for various other methods with the Quadratic 
kernel, rank = 100. The following notation is used in the figure - VF: V Formu- 
lation, VQ: V Formulation with QR Decomposition, SR: Subset of Regressors, 
RR: Reduced Rank, CU: Cholesky Update, CG: Conjugate Gradient 


The most accurate covariance function in Figures 4-7 was the neural net- 
work covariance function. The graph in Figure 8 provides more detail for this 
covariance function using the SR and V method with rank = 100. For compar- 
ison we also include in Figure 9 the result of the V method using the quadratic 
covariance function with rank — 21 (the mathematical rank of the quadratic 
covariance function is 21). Note that the graph clearly indicates that the inac- 
curacy of the SR method approximately 80 of the 100 samples and also indicates 
that the neural network kernel is more accurate than the quadratic kernel. 
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Error Comparison 


V Formulation vs. SR 



Figure 9: Bootstrap resampling for neural network kernel, rank =100, with the 
SR and V methods, and the Quadratic kernel, rank = 21, V method. 

12.2 The choice of rank 

As mentioned in Section 9 and Section 11 we can calculate the RMSE errors 
corresponding to all lower rank approximations with a modest amount of extra 
effort. This can be used to illustrate the effect of varying the rank of the low 
rank approximation. For example we can do bootstrap resampling with a low 
rank approximation with rank m — 1000 and also obtain the result of bootstrap 
resampling with any lower rank. This is pictured in Figure 10. 


Neural Net: rmse for various ranks of low rank approximation 



Figure 10: Bootstrap resampling with the neural network covariance function 
and ranks 100, 200, 500 and 1000 and, for comparison, the SR method with 
rank 500. 

We can also use these calculations to plot the median rmse error versus rank. 
This is pictured in Figure 11. 

For the neural network covariance function the above results indicate that 
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median rmse error versus rank 



Figure 11: The median rmse error versus rank. 


the approximations become significantly more accurate as the ranks increase 
from 100 to 200 and to 500, but that the improvement in accuracy is small for 
ranks larger than 500. Also note that Figure 10 again indicates the potential 
instability of the SR approach. 

We should note that due to memory limitations with our rank = 1000 com- 
putations, we reduced the training set size to 36009 for the above runs. Also 
note that in the above runs we used bootstrap resampling without replacement. 

12.3 The effect of pivoting 

Figure 12 compares the result of using the V method without pivoting and the 
V method with pivoting: 


Neural Net: rmse for various ranks of low rank approximation 



Figure 12: Comparison of the V method without pivoting and the V method 
with pivoting with the neural network covariance function and ranks of 100, 200 
and 1000. 

These results suggest that, for the redshift data, the effect of pivoting is 
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relatively small. The effect of pivoting is less than one percent in the results 
pictured in Figure 12. When using rank 100 low rank approximations the pre- 
dictions are slightly worse with pivoting. On the other hand for the rank 1000 
approximations the predictions are slightly better when pivoting is included. 
The reason for result like those pictured in Figure 12 deserves further study. 

12.4 The choice of the testing set size 

The training set for the redshift calculations is large - it consists of more than 
180000 galaxies. Is all this data necessary to develop a good model? 

We can explore this by repeating calculation using training sets of various 
sizes. The results of this calculation is summarized in the Figure 13. 


median rmse error vs. percent of 180045 galaxies used in model 



Figure 13: The effect of varying the training set size. 

This figure suggests that the entire data set is not needed to get good ap- 
proximations. Indeed after the training set size is approximately 36000 ( 20 % 
of 180045 ) the is little or no improvement in the accuracy of the approximation 
as the training set size is increased. Indeed, the accuracy of the predictions 
decreases a small amount as the size of the training set is increased beyond 
a minimum point for each rank pictured. Potentially this could be related to 
selection of hyperparameters, which, as discussed in Section 10, were selected 
using a small (500 for most of our runs) subset of the data. This topic also 
deserves further investigation. 

12.5 Comparison with Traditional Method 

Although it is not practical to use the traditional Gaussian processes method 
with all of the available training data, it is possible with a small subset of the 
training data. In the following graph, 4000 of the 180, 045 rows of X were 
selected randomly without replacement to form X(I) for some index I. The 
equation y* = /\*(A 2 / + A') _1 y was then used with K being the covariance 
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matrix of X(I). This process was repeated 100 times with different random 
indices (figure 14). 


Traditional Method with a Subset of X of Size 4000 



Model Number 


Figure 14: Bootstrap resampling (without replacement) with training sets con- 
sisting of 4000 galaxies. 

A subset of 4000 inputs is the practical limit given the available resources. 
With 4000 galaxies the accuracy is worse than that of the V Formulation with 
low rank approximations of ranks 200, 500 or 1000 (see figure 10). We can also 
redo the computations for training set sizes of less than 4000 galaxies to obtain 
the results in figure 15. 


Traditional Method with Training Sets of Increasing Size 



Figure 15: Error versus training set size. As the training set increases in size, 
the RMS seems to converge to 0.021. 

However, this is not to say that the traditional method does not show 
any promise. It is not necessary to choose the training set randomly, as was done 
in the tests graphed below. There may be a way to choose a particular training 
set which is ideal in some way, and in that case the traditional method (or the 
V formulation with dramatically increased rank) may be more effective. One 
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possible way to do this is to use the partial Cholesky decomposition to find an 
index of X which places the most linearly independent rows first. Unfortunately, 
the practical limit on the partial Cholesky decomposition on a 180, 045 x 5 
matrix is only m — 1000 or so, and so the effectiveness of this method may not 
currently be tested. 

12.6 Comparison with alternate approaches 

In [Way and Srivastava, 2006] redshift prediction are made with a variety of 
traditional and novel approaches. These approaches include traditional linear 
regression, traditional quadratic regression, artificial neural networks, an E- 
model approach and Gaussian processes using the quadratic kernel. See [Way 
and Srivastava, 2006] for details. In figure 16 we compare the results of our 
Gaussian process calculations with their results. The results labeled GP-V are 
our results using the neural network covariance function with rank 1000 using 
the V method with pivoting (and without using a QR factorization). The other 
results pictured are obtained from [Way and Srivastava, 2006] 


u-g-r-i-z 



Figure 16: Comparison of a variety of methods of redshift prediction. 

The primary conclusion from this figure is that our results are competitive 
with the best known alternative results. 


13 Outliers 

Removing outliers from the testing matrix X* may allow us to significantly 
lower the root mean square error of our prediction. Outliers may be calculated 
by sorting the vector y* — y* . The graphs of y* — y* so far indicate that there 
is a set of outliers which is constant relative to choice of method and kernel. 
In other words, it is so far evidenced that there is a set of outliers common 
to the general method of approximating y* as a Gaussian Process. The sorted 
graphs indicate that a significant set of outliers exist for each method and each 
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covariance function. In fact, if 10 percent of the outliers are removed from the 
testing data, the RMS error on the remaining points drops to nearly 0.016. It 
cannot be concluded from the sorted graphs that the set of outliers is invariant 
with respect to choice of covariance function or method. However, the unsorted 
graphs do indicate that the outliers are a common set of points regardless of 
which covariance function is chosen and which method is used. This is because 
the “spikes” in these graphs tend to overlap. 



Figure 17: Graph which shows the presence of outliers. 



Figure 18: Graph which shows the distribution of outliers. 


13.1 Characterizing Outliers 

Given the hypothesis that there is a constant set of outliers, they still may 
not be discarded unless they can be characterized, so that all inputs sharing a 
particular characteristic may be removed from the testing set. Let I be a vector 
corresponding to the index of the sorted y* — y* vector, with y* corresponding 
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to the Neural Net covariance matrix and V-formulation with QR factorization. 
The input columns of the testing data corresponding to u, g, r, i,z were each 
sorted by the index /, meaning that the outliers should correspond to the first 
and last thousand entries or so of each of these resorted columns. However, 
examining the columns one at a time does not reveal any pattern. It is possible 
that some sort of multivariate correlation analysis would be able to detect a 
more complicated pattern among these columns. Similarly, y* was resorted 
by the index /, and no pattern was readily apparent. This data can also be 
included in a more in-depth analysis. One approach that might help is to block 
groups of galaxies. It is also possible to view the characterization of outliers as 
a classification problem, and this idea is addressed later in this section. 



Figure 19: Columns of X * , y* and y* for the testing set. 

However, if the inputs of these graphs are blocked together, and the output 
is taken to be the norm of these blocks, then a pattern begins to emerge. 
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Figure 20: Columns of X *, y* and y* for the testing set in blocks of 20. 


A slight U-shaped pattern appears in the columns of X*. A linear 
correlation seems to exist in y* , although this has limited usefulness since y* is 
not available for characterization purposes. 
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Figure 21: Columns of X*, y* and y* for the testing set in blocks of 100. 
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A stronger pattern exists here, especially in columns 3, 4, and 5. 
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Figure 22: Columns of X* , y* and y* for the testing set in blocks of 505. 

The norm of a the set of 505 entries which all correspond to the outliers 
in the fourth column of X *, for example, is greater than 375 even if we take 20% 
of the set to be outliers. It may thus be reasonable to take this as a necessary 
condition for a set which contains only outliers. Moreover, it seems reasonable 
that for a set which contains a percentage a% of non-outliers, the norm must 
be at least d-°)375+(a)37i. 5 - 
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13.2 Classifying Outliers 


Calculating the outliers of a testing set X* can be viewed as a classification 
problem. In other words, given training data X and a vector y with the property 
that y{i) = 1 if 27 is an outlier, and y{i) = 0 if 27 is a non-outlier, and similarly 
with testing data X* and y* , calculate an approximation of y* . This can be 
recast as a regression problem by taking y = \y — y|, where y is an approximation 
of y, and y* = \y* — y*\- To calculate y it is thus necessary to take a subset of 
the training data, and use one subset as training data to approximate another as 
testing data. This is equivalent to solving the classification problem, because it is 
the index of y* that is used to define outliers and non-outliers. Therefore, the V 
Formulation can be used indirectly to classify the outliers. First, approximate y* 
as 7. It then follows that the index of the sorted 7 can be used to predict a set of 
outliers in X* (those highest in the index) , and after these have been removed the 
original regression problem may be solved. This results in a significant reduction 
in rms error if 10% of the data is removed, as shown in figure 13. This method 
may be further refined by recursively finding outliers in the approximated set of 
outliers, adding these back in, finding outliers in that set and taking them back 
out, and so on. 


Bootstrap Resampling with Outliers Removed 



Figure 23: Bootstrap resampling with outliers removed. 
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14 Appendix - Code 

Code implementing and illustrating the above ideas is available at 
www.math.sjsu.edu/~foster/camcos07/redshift.html. 

The easiest way to get started using the code is to download and unzip the 
zip file from the site, start Matlab (7.0 or higher), move to the appropriate 
folder and type either “demo_bootstrap” or “demoJiistory” to run one of the 
demonstration files. 

Code in the zip file is based on the code used in the text Gaussian Processes 
for Machine Learning by Rasmussen and Williams [Rasmussen and Williams, 
2006] (www.gaussianprocess.org/gpml/). So that the demonstrations are self 
contained and do not require that the user download additional code, the zip file 
contains a few functions that are copied directly from Rasmussen and Williams’ 
web site. 

Note that a number of the functions (gprSRPPO, gprSRPPl, gprSRPPlb, 
gprSRPP2, gprSRPP3, and gprSRPP4) are written to be compatible, including 
usage and parameter lists, with Rasmussen and Williams function gprSRPP 
which does low rank approximation. In Rasmussen and Williams’ code, exam- 
ples and demonstrations any call to Rasmussen and William function gprSRPP 
can be replaced by a call to one of these functions. There are additional op- 
tions available in our code that are not part of Rasmussen and Williams code. 
However one one does not need to use these. 

Below we we briefly describe each of the files available from 
www.math.sjsu.edu/~foster/camcos07/redshift.html. 

Additional descriptions for the files are contained in the code. 

Demonstration files: 

Bootstrap \ accuracy demo: demo_bootstrap.m - demonstration pro- 
gram comparing the numerical stability and accuracy of various algorithms for 
Gaussian process regression algorithms using low rank approximation and illus- 
trating bootstrap resampling. 

History demo: demo_history.m - demonstration program illustrating a fea- 
ture that allows efficient calculation of the accuracy of all low rank approxima- 
tions less than or equal to a specified rank when using gprSRPPl. (This feature 
is also available in gprSRPPlb, gprSRPP2, gprSRPP3, and gprSRPP4.) 

A data file: 

redshift_data.mat - contains the training set data X and y and the test- 
ing set data Xtest and ytest for the redshift problem. Also the logarithms of 
hyperparameters listed in Section 10 are included in the structure loghyper. 
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Functions: 


The SR method: gprSRPPO.m - Carries out approximate Gaussian pro- 
cess regression prediction using the subset of regressors (SR) or projected process 
approximation (PP) with the active set specified by the user. gprSRPPO is a 
minor change from the gprSRPP code provided by Rasmussen and Williams. 
gprSRPPO avoids repeating certain calculations. 

The V Method without pivoting: gprSRPP l.m - Carries out approxi- 
mate Gaussian process regression prediction using the subset of regressors (SR) 
or projected process approximation (PP) with the active set specified by the 
user. This version uses a variation of the Peters- Wilkinson approach for solving 
least squares problems to increase the numerical stability of the algorithm. 

The V method without pivoting (implemented using partial Cholesky 
factorization without pivoting): gprSRPPlb.m - Carries out approximate 
Gaussian process regression prediction using the subset of regressors (SR) or 
projected process approximation (PP) with the active set specified by the user. 
This version uses a variation of the Peters-Wilkinson approach for solving least 
squares problems to increase the numerical stability of the algorithm. The 
version uses a partial Cholesky factorization without pivoting to factor the 
columns of the covariance matrix specified by the user. This code is slower 
than gprSRPP 1 but requires less memory. 

The V method with pivoting: gprSRPP2.m - Carries out approximate 
Gaussian process regression prediction using the subset of regressors (SR) or 
projected process approximation (PP) with the active set selected by a partial 
Cholesky factorization with pivoting. This version uses a variation of the Peters- 
Wilkinson approach for solving least squares problems (and partial Cholesky 
factorization with pivoting) to increase the numerical stability of the algorithm. 

The SRQR method: gprSRPP3.m - Carries out approximate Gaussian 
process regression prediction using the subset of regressors (SR) or projected 
process approximation (PP) with the active set specified by the user. This ver- 
sion uses a QR factorization to increase the numerical stability of the algorithm. 

The V method with QR factorization (VQ): gprSRPP4.m - Carries out 
approximate Gaussian process regression prediction using the subset of regres- 
sors (SR) or projected process approximation (PP) with the active set selected 
by a partial Cholesky decomposition with pivoting. This version uses a QR 
factorization (and partial Cholesky with pivoting) to increase the numerical 
stability of the algorithm. 

Partial Cholesky: choLpart.m -This function does a partial Cholesky de- 
composition, with pivoting or optionally without pivoting, of the kernel matrix 


45 



K defined by a covariance function that follows Rasmussen and Williams style 
for covariance functions. 

Quadratic Covariance function: covQUADiso.m Quadratic covariance 
function with isotropic distance measure. Written in the style of Rasmussen and 
Williams covariance functions. 

Quadratic Covariance function for Automatic Relevance Deter- 
mination: covQUADard.m - Quadratic covariance function with Automatic 
Relevance Determination (ARD) distance measure. Written in the style of Ras- 
mussen and Williams covariance functions. 

Cubic Covariance function: covCUBICiso.m - Cubic covariance function 
with isotropic distance measure. Written in the style of Rasmussen and Williams 
covariance functions. 

Cubic Covariance function for Automatic Relevance Determina- 
tion: covCUBICarcl.m - Cubic covariance function with Automatic Relevance 
Determination (ARD) distance measure. Written in the style of Rasmussen and 
Williams covariance functions. 

Polynomial Covariance function: covPOLYiso.m - Polynomial covari- 
ance function with isotropic distance measure. Default degree is 4. Written in 
the style of Rasmussen and Williams covariance functions. 

Polynomial Covariance function for Automatic Relevance Deter- 
mination: covPOLYard.m - Polynomial covariance function with Automatic 
Relevance Determination (ARD) distance measure. Default degree is 4. Written 
in the style of Rasmussen and Williams covariance functions. 

Some utilities: 

sq_dist.m - a function to compute a matrix of all pairwise squared distances 
between two sets of vectors, stored in the columns of the two matrices, a (of size 
D by n) and b (of size D by m) . If only a single argument is given or the second 
matrix is empty, the missing matrix is taken to be identical to the first. NOTE: 
This version of sq_dist calls a FORTRAN mex file and will be more efficient 
than the sq_dist.m file supplied with Rasmussen and Williams code. 

sq_distf.dll - a dynamic link library file created by Matlab’s mex utility. It is 
called by sq_dist.m. This utility works for Matlab 7.0, 7.1 and 7.2 but has not 
been successfully tested for Matlab 7.4. 
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